Data-driven mapping refers to the process of using data values to determine the symbology of mapped features. Color, shape, and size are the three most common graphic elements used to symbolize data-driven maps. Data-driven maps are often referred to as thematic maps. In this lesson we take a deep dive into data driven mapping!
Instructor Notes
There are two primary types of thematic maps:
Choropleth maps, which set the color of areas (polygons) by data value
Point symbol maps, which set the color or size of points by data value
Many of the techniques for creating these maps can also be used with line data, although it is less common.
We review both of these types of maps in more detail in this lesson. First, let’s load the R libraries we will use.
library(sf)
library(tmap)
library(here)
Choropleth maps are the most common type of thematic map.
Let’s use an sf data.frame of California counties to make a choropleth map.
First, read in the counties data with the st_read function.
counties <- st_read(here("notebook_data",
"california_counties",
"CaliforniaCounties.shp"))
## Reading layer `CaliforniaCounties' from data source
## `/Users/pattyf/Documents/Dlab/workshops/AY2022/R-Geospatial-Fundamentals/notebook_data/california_counties/CaliforniaCounties.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 58 features and 24 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -374445.4 ymin: -604500.7 xmax: 540038.5 ymax: 450022
## Projected CRS: NAD83 / California Albers
Then, make a basic map of our county boundaries.
plot(counties$geometry)
Now, take a look at the spatial dataframe.
head(counties)
## Simple feature collection with 6 features and 24 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -267387.9 ymin: -578158.6 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
## NAME STATE_NAME POP2010 POP10_SQMI POP2012 POP12_SQMI WHITE BLACK
## 1 Kern California 839631 102.9 851089 104.282870 499766 48921
## 2 Kings California 152982 109.9 155039 111.427421 83027 11014
## 3 Lake California 64665 48.6 65253 49.082334 52033 1232
## 4 Lassen California 34895 7.4 35039 7.422856 25532 2834
## 5 Los Angeles California 9818605 2402.3 9904341 2423.264150 4936599 856874
## 6 Madera California 150865 70.1 153025 71.065672 94456 5629
## AMERI_ES ASIAN HAWN_PI HISPANIC OTHER MULT_RACE MALES FEMALES MED_AGE
## 1 12676 34846 1252 413033 204314 37856 433108 406523 30.7
## 2 2562 5620 271 77866 42996 7492 86344 66638 31.1
## 3 2049 724 108 11088 5455 3064 32469 32196 45.0
## 4 1234 356 165 6117 3562 1212 22416 12479 37.0
## 5 72828 1346865 26094 4687889 2140632 438713 4839654 4978951 34.8
## 6 4136 2802 162 80992 37380 6300 72682 78183 33.1
## AVE_HH_SZ AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1 3.15 3.61 284367 29757 152828 101782 06103
## 2 3.19 3.59 43867 2634 22329 18904 06089
## 3 2.39 2.94 35492 8944 17472 9076 06106
## 4 2.50 2.98 12710 2652 6590 3468 06086
## 5 2.98 3.58 3445076 203872 1544749 1696455 06073
## 6 3.28 3.63 49140 5823 27726 15591 06102
## geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...
## 3 MULTIPOLYGON (((-235734.3 1...
## 4 MULTIPOLYGON (((12.28914 35...
## 5 MULTIPOLYGON (((173874.5 -4...
## 6 MULTIPOLYGON (((16681.16 -1...
In particular, we are interested in the columns with numeric values as these are the ones typically used to make data maps.
To get started, let’s create a choropleth map by setting the color of each county based on the value in the population per square mile column (POP12_SQMI).
Recall that sf’s plot method does this by default! So, here’s the quickest way to make a choropleth:
plot(counties['POP12_SQMI'])
By default, sf::plot linearly scales the colors to the data values. This is called a proportional color map.
A proportional color map will have a legend with a continuous color ramp rather than discrete data ranges.
A key benefit of a proportional color map is that it depicts the full range of data values without imposing any groupings.
tmapWe can also use tmap to create thematic maps. This package gives us greater control over the visualization details so we can better explore the distribution of data values.
In tmap, instead of setting the col argument to the same static value for all features (e.g. ‘red’, ‘#ef03a5’), we can set it to the name of the column by which we want our polygons colored (e.g. ‘POP12_SQMI’).
# Set the mapping mode to a static plot (not interactive)
tmap_mode('plot')
## tmap mode set to plotting
# Map the county polygons colored by the values in the POP12_SQMI column
tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
title = "Population Density per mi^2")
By default, tmap uses a yellow-orange-brown (YlOrBr) sequential color palette for thematic maps and bins those colors into 3 to 7 classes of approximately equal intervals with rounded values for class breaks.
Of course, we can also use tmap’s interactive mapping mode. Do you recall the syntax for:
setting the tmap mode to static vs interactive mapping?
or toggling between these two modes?
Let’s make an interactive map, making our layer partially transparent, i.e. alpha = 0.4, so that we can see the basemap through our polygons.
tmap_mode('view')
## tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE) # force tmap to display invalid polygons
tm_shape(counties) +
tm_polygons(col='POP12_SQMI', alpha=0.5,
title = "Population Density per mi^2")
## Warning: The shape counties is invalid (after reprojection). See sf::st_is_valid
That’s really the heart of of creating a choropleth map with tmap. To set the color of the features based on the values in a column, set the col argument to the column name in the sf data.frame (cast as a string!).
Before we move on, let’s use the st_make_valid function to fix the county geometry!
counties <- st_make_valid(counties)
Redo the map above, but mapping population (POP2012) NOT population density.
# Map of County Population (POP2012)
Question
What map better conveys CA county population - POP12_SQMI or POP2012?
The goal of a thematic map is to use color to visualize the spatial distribution of a variable in order to identify trends and outliers.
Another goal is to use color to effectively and quickly convey information. For example,
maps use brighter or richer colors to signify higher values,
and leverage cognitive associations such as mapping water with the color blue.
There are two major challenges when creating thematic maps:
Our eyes are drawn to the color of larger areas or linear features, even if the values of smaller features are more significant.
The range of data values is rarely evenly distributed across all observations and thus the colors can be misleading.
Questions
Do you see either of these problems in our population-density map?
hist(counties$POP12_SQMI,
breaks = 40,
main = 'Population Density per mi^2')
There are three main techniques for dealing with these mapping challenges:
Color palettes
Data transformations
Classification schemes
There are three main types of color palettes (or color maps), each of which has a different purpose:
diverging - a “diverging” set of colors are used so emphasize mid-range values as well as extremes.
sequential - usually with a single or multi color hue to emphasize differences in order and magnitude, where darker colors typically mean higher values
qualitative - a contrasting set of colors to identify distinct categories and avoid implying quantitative significance.
Tip: Sites like ColorBrewer let’s you play around with different types of color maps.
To see the names of all color palettes avaialble to tmap, try the following command. You may need to enlarge the output image.
RColorBrewer::display.brewer.all()
As a best practice, a qualitative color palette should not be used with quantitative data and vice versa. For example, consider this map that EDM.com published of top dance tracks by state.
For a number of reasons, data are often distributed in aggregated form. For example, the Census Bureau collects data from individual people, households and businesses and distributes it aggregated to states, counties, and census tracts, etc.
When the aggregated data are counts, like total population, they can be transformed to densities, proportions and ratios. These normalized variables are more comparable across regions that differ greatly in size.
Let’s consider this in terms of our data.
The basic cartographic rule is that when mapping data for areas that differ in size you rarely map counts since those differences in size make the comparison less valid or informative.
Another way to make more meaningful maps is to improve the way in which data values are associated with colors.
The common alternative to a proportional color map is to use a classification scheme to create a graduated color map. This is the standard way to create a choropleth map.
A classification scheme is a method for binning continuous data values into 4-7 classes (the default is 5) and then associate those classes with the different colors in a color palette.
tmapClassification schemes can be implemented using the tmap geometry functions (tm_polygons, tm_dots, etc.) by setting a value for the style argument.
Here are some of the tmap keyword names for the different classification styles (see the documentation: ?tm_polygons):
equal, quantile,fisher, jenks, headtails, fixed, kmeans, pretty.For more information about classification schemes see ?classIntervals or sources such as this page in the Geocomputation with R ebook.
Let’s redo the previous map using the quantile classification scheme.
tmap_mode('plot')
## tmap mode set to plotting
# Plot population density - mile^2
tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
style = "quantile",
alpha = 0.5,
title = "Population Density per mi^2")
Redo the previous map with these classification schemes: headtails, equal, jenks
You may get pretty close to your final map without being completely satisfied. In this case you can manually define a classification scheme.
Let’s customize our map with a user-defined classification scheme where we manually set the breaks for the bins using the classification_kwds argument.
tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
palette = "YlGn",
style = 'fixed',
breaks = c(0, 50, 100, 200, 300, 400, max(counties$POP12_SQMI)),
title = "Population Density per Square Mile")
Since we are customizing our plot, we can also edit our legend to specify the text, so that it’s easier to read.
tm_add_legend to build our own customized legend.tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
palette = "YlGn",
style='fixed',
breaks = c(0, 50, 100, 200, 300, 400, max(counties$POP12_SQMI)),
legend.show = F) +
tm_add_legend('fill', col = RColorBrewer::brewer.pal(6, "YlGn"),
border.col = "black",
title = "Population Density per Sq Mile",
labels = c('<50','50 to 100','100 to 200','200 to 300','300 to 400','>400'))
If we look at the columns in our dataset, we see we have a number of variables from which we can calculate proportions, rates, and the like.
Let’s try that out:
head(counties)
## Simple feature collection with 6 features and 24 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -267387.9 ymin: -578158.6 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
## NAME STATE_NAME POP2010 POP10_SQMI POP2012 POP12_SQMI WHITE BLACK
## 1 Kern California 839631 102.9 851089 104.282870 499766 48921
## 2 Kings California 152982 109.9 155039 111.427421 83027 11014
## 3 Lake California 64665 48.6 65253 49.082334 52033 1232
## 4 Lassen California 34895 7.4 35039 7.422856 25532 2834
## 5 Los Angeles California 9818605 2402.3 9904341 2423.264150 4936599 856874
## 6 Madera California 150865 70.1 153025 71.065672 94456 5629
## AMERI_ES ASIAN HAWN_PI HISPANIC OTHER MULT_RACE MALES FEMALES MED_AGE
## 1 12676 34846 1252 413033 204314 37856 433108 406523 30.7
## 2 2562 5620 271 77866 42996 7492 86344 66638 31.1
## 3 2049 724 108 11088 5455 3064 32469 32196 45.0
## 4 1234 356 165 6117 3562 1212 22416 12479 37.0
## 5 72828 1346865 26094 4687889 2140632 438713 4839654 4978951 34.8
## 6 4136 2802 162 80992 37380 6300 72682 78183 33.1
## AVE_HH_SZ AVE_FAM_SZ HSE_UNITS VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1 3.15 3.61 284367 29757 152828 101782 06103
## 2 3.19 3.59 43867 2634 22329 18904 06089
## 3 2.39 2.94 35492 8944 17472 9076 06106
## 4 2.50 2.98 12710 2652 6590 3468 06086
## 5 2.98 3.58 3445076 203872 1544749 1696455 06073
## 6 3.28 3.63 49140 5823 27726 15591 06102
## geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...
## 3 MULTIPOLYGON (((-235734.3 1...
## 4 MULTIPOLYGON (((12.28914 35...
## 5 MULTIPOLYGON (((173874.5 -4...
## 6 MULTIPOLYGON (((16681.16 -1...
Let’s calculate the percent of the population that is hispanic and save it to a new column. Then, we can use that to create a choropleth map.
# calculate percent hispanic as a new column
counties$pct_hispanic = counties$HISPANIC/counties$POP2012 * 100
# Plot percent hispanic as choropleth
tm_shape(counties) +
tm_polygons(col = 'pct_hispanic',
palette = 'Blues',
style = 'fixed',
breaks = c(0, 20, 40, 60, 80, 100),
border.col = "darkgrey",
lwd = 1.5,
legend.show = F) +
tm_add_legend('fill', col = RColorBrewer::brewer.pal(5, "Blues"),
border.col = "darkgrey",
title = "Percent Hispanic Population",
labels = c('<20%',
'20% - 40%',
'40% - 60%',
'60% - 80%',
'80% - 100%'))
Questions
What new options and operations have we added to our code?
How many values do we specify in the breaks vector, and how many bins are in the map legend? Why?
Choropleth maps are great, but point maps enable us to visualize our spatial data in another way.
If you know both mapping methods you can expand how much information you can show in one map.
For example, point maps are a better way to map counts because the varying sizes of areas are deemphasized.
We can use the sf::st_centroid function to dynamically transform the county polygons to their centroids (point centers).
We then use the tm_dot elementto create point maps dynamically from polygon data! Let’s take a look.
# County population counts as a point map!
tmap_mode('plot')
## tmap mode set to plotting
# Add the county polygon borders as a basemap
tm_shape(counties) +
tm_borders(col = "grey") +
# Then map the county centroids as points colored by population counts
tm_shape(st_centroid(counties)) +
tm_dots(col = 'POP2012',
palette = 'YlOrRd',
style = 'jenks',
border.col = "black", # dot borders only visible in interactive mode!
border.lwd = 1,
border.alpha = 1,
size = .5,
legend.show = T)
## Warning in st_centroid.sf(counties): st_centroid assumes attributes are constant
## over geometries of x
This is another useful type of data transformation for making effective maps.
Let’s read in some data that is more typically encoded with point geometry - Alameda County schools.
schools_df <- read.csv(here("notebook_data",
"alco_schools.csv"))
head(schools_df)
## X Y Site Address City
## 1 -122.2388 37.74476 Amelia Earhart Elementary 400 Packet Landing Rd Alameda
## 2 -122.2519 37.73900 Bay Farm Elementary 200 Aughinbaugh Way Alameda
## 3 -122.2589 37.76206 Donald D. Lum Elementary 1801 Sandcreek Way Alameda
## 4 -122.2348 37.76525 Edison Elementary 2700 Buena Vista Ave Alameda
## 5 -122.2381 37.75396 Frank Otis Elementary 3010 Fillmore St Alameda
## 6 -122.2616 37.76911 Franklin Elementary 1433 San Antonio Ave Alameda
## State Type API Org
## 1 CA ES 933 Public
## 2 CA ES 932 Public
## 3 CA ES 853 Public
## 4 CA ES 927 Public
## 5 CA ES 894 Public
## 6 CA ES 893 Public
We got it from a plain CSV file, let’s promote it to an sf data.frame.
schools_sf <- st_as_sf(schools_df,
coords = c('X','Y'),
crs = 4326)
Then we can map it.
plot(schools_sf)
What is useful about the above display of the maps for each column in the dataframe is that at a glance you can see the type of data variable and get a sense of the range of values.
The default sf::plot point map for a numeric data column is a proportional color map that linearly scales the color of the point symbol by the data values.
# Point map of API - Academic Performance Index
plot(schools_sf['API'])
tmapLet’s try creating the same map with tmap.
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(schools_sf) +
tm_dots(col = "API")
The basic tmap graduated color map needs some customization to shine, especially in plot mode!
We can add the outline of Alameda County to improve our map.
tmap_mode('plot')
## tmap mode set to plotting
# Add Alameda County outline
tm_shape(counties[counties$NAME=='Alameda',]) +
tm_borders() +
# Add Alameda County Schools
tm_shape(schools_sf) +
tm_dots(col = "API") +
# Position the legend in the bottom left corner
tm_legend(position=c("left", "bottom"))
By default, tmap uses a yellow-orange-brown (YlOrBr) sequential color palette and the pretty classification scheme for point thematic maps. These are the same defaults that are used for tmap choropleth maps. Note, point maps that symbolize data values by color are called Graduated Color Maps. In spite of the different map names, the color and classification scheme options are almost identical in tmap! However, some options will be different - for example, a size parameter makes sense for a point radius but not a polygon!
See
?tm_dotfor more information about the options for customizing point maps! For example…
# API Graduated Color Map
# Add the county polygon
tm_shape(counties[counties$NAME=='Alameda',]) +
tm_polygons(col="lightgrey") +
# Add the Schools
tm_shape(schools_sf[order(schools_sf$API),]) +
tm_dots(col = 'API',
size = 0.15,
palette = 'Reds',
style = 'fixed',
breaks = c(0, 200, 400, 600, 800, 1000),
border.col = 'grey',
legend.show = F) +
# Customize the legend
tm_add_legend('fill',
title = 'Alameda County, school API scores',
labels = c('<200',
'[200,400)',
'[400,600)',
'[600,800)',
'>800'),
col = RColorBrewer::brewer.pal(5, "Reds")) +
# position the legend
tm_layout(legend.position = c('right', 'top'))
Another important type of point map is the proportional symbol map. These are like proportional color maps but instead of associating symbol color with data values they associate symbol size. You can make these in tmap with the tm_bubbles function.
The schools data does not contain any good variables for proportional symbol mapping so we will read in a supplemental file of National Center for Educational Statistics (NCES) data and join it to the school points.
df <- read.csv(here("notebook_data",
"other",
"PolicyMap_NCES_Data_20210429.csv"))
# take a look
head(df, 2)
## School.Name
## 1 California School for the Deaf-Fremont
## 2 Dublin Elementary
## Education.Agency
## 1 California School for the Deaf-Fremont (State Special Schl)
## 2 Dublin Unified
## School.Location City State Zip Type.of.School
## 1 39350 Gallaudet Dr. Fremont CA 94538 Special Education School
## 2 7997 Vomac Rd. Dublin CA 94568 Regular School
## School.status.at.time.of.last.report School.Level County
## 1 Open Other Alameda County
## 2 Open Primary Alameda County
## Grades.Offered Total.Students
## 1 Kindergarten, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 375
## 2 Kindergarten, 1, 2, 3, 4, 5 894
## Full.time.equivalent.Classroom.Teachers Student.Teacher.Ratio
## 1 73.91 5.07
## 2 33.60 26.61
## National.School.Lunch.Program.Status
## 1 No
## 2 No
## Free.and.Reduced.price.Lunch.Eligible.Students Title.I.Eligible
## 1 372 Unknown
## 2 111 Title I Eligible School
## School.wide.Title.I Title.I.Status
## 1
## 2 Not School-wide Title I Title I targeted assistance school
## GreatSchools.Link
## 1 http://www.greatschools.net/modperl/browse_school/CA/2297
## 2 http://www.greatschools.net/modperl/browse_school/CA/55
## NCES.Public.School.ID Urban.centric.Locale Magnet.School Charter.School
## 1 60000310347 Suburban, Large
## 2 60001906929 Suburban, Large
## Shared.time.School Virtual.School Reconstituted Total.Students.1
## 1 NA NA 375
## 2 NA NA 894
## Prekindergarten.Students Kindergarten.Students Grade.1.Students
## 1 NA 36 26
## 2 NA 350 300
## Grade.2.Students Grade.3.Students Grade.4.Students Grade.5.Students
## 1 24 22 40 32
## 2 296 294 290 258
## Grade.6.Students Grade.7.Students Grade.8.Students Grade.9.Students
## 1 60 48 52 84
## 2 NA NA NA NA
## Grade.10.Students Grade.11.Students Grade.12.Students
## 1 82 92 152
## 2 NA NA NA
## High.School.Students.Earning.College.Credit.or.CTE.Students.Beyo
## 1 NA
## 2 NA
## Adult.Education.Students Ungraded.Students White..Non.Hispanic.Students
## 1 0 NA 174
## 2 0 NA 582
## Hispanic.Students Black..Non.Hispanic.Students Asian..Non.Hispanic.Students
## 1 382 58 88
## 2 296 50 658
## American.Indian.Alaska.Native.Students
## 1 NA
## 2 12
## Hawaiian.Native.Pacific.Islander.Students Two.or.More.Races.Students
## 1 4 44
## 2 20 170
## Percent.White..Non.Hispanic Percent.Hispanic Percent.Black..Non.Hispanic
## 1 23.20 50.93 7.73
## 2 32.55 16.55 2.80
## Percent.Asian..Non.Hispanic Percent.American.Indian.Alaska.Native
## 1 11.73 NA
## 2 36.80 0.67
## Percent.Hawaiian.Native.Pacific.Islander Percent.Two.or.More.Races
## 1 0.53 5.87
## 2 1.12 9.51
## All.Female.High.School Majority.Minority.High.School Inner.City.High.School
## 1
## 2
Subset to keep only a few columns
df2 <- df[c('School.Name',
'Student.Teacher.Ratio',
'Free.and.Reduced.price.Lunch.Eligible.Students')]
# Rename the columns
colnames(df2) <- c('Site',
'STRatio',
'RLunch')
# take a look
head(df2, 2)
## Site STRatio RLunch
## 1 California School for the Deaf-Fremont 5.07 372
## 2 Dublin Elementary 26.61 111
sf spatial dataframeschools_sf2 <- merge(schools_sf, df2, by = "Site")
# take a look
head(schools_sf2, 2)
## Simple feature collection with 2 features and 9 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -122.2273 ymin: 37.75293 xmax: -122.1861 ymax: 37.78232
## Geodetic CRS: WGS 84
## Site Address City State Type API Org
## 1 Achieve Academy 1700 28th Avenue Oakland CA ES 788 Public
## 2 ACORN Woodland Elementary 1025 81st Avenue Oakland CA ES 782 Public
## STRatio RLunch geometry
## 1 20.99 642 POINT (-122.2273 37.78232)
## 2 23.08 276 POINT (-122.1861 37.75293)
Now we can create a map using the Free/Reduced lunch values
tmap_mode('plot')
## tmap mode set to plotting
# Add the county polygon
tm_shape(counties[counties$NAME=='Alameda',]) +
tm_polygons(col="lightgrey") +
tm_shape(schools_sf2) +
tm_bubbles(size = "RLunch",
col = "pink",
border.col = 'black',
title.size = "Students Eligible for Free/Reduced Lunch") +
tm_layout(legend.position = c('right', 'top'))
Question
What does this code do?
ttm()
tmap_last()
Mapping categorical data, also called qualitative data, is a bit more straightforward. There is no need to scale or classify data values. The goal of the color map is to provide a contrasting set of colors so as to clearly delineate different categories. Here’s a point-based example:
# Add the county polygon
tm_shape(counties[counties$NAME=='Alameda',]) +
tm_polygons(col="lightgrey") +
tm_shape(schools_sf) +
tm_dots(col = 'Org',
size = 0.15,
palette = 'Spectral',
title = "School Type") +
tm_layout(legend.position = c('left', 'bottom'))
We learned about important data driven mapping strategies and mapping concepts, including:
Point and polygons are not the only geometry-types that we can use in data-driven mapping! You can also map linear features by associating data values with the color, shape and size of features. But these types of maps are less common.
Practice creating choropleth and graduated color maps with the counties data. Pick one quantitative variable like MED_AGE and try different color palettes and classification schemes.
Then, try the following:
Create a tmap plot of county polygons colored by ‘MED_AGE’
Overlay county centroids, colored by ‘AVE_HH_SZ’
Does the map suggest any relationship between these two variables?
Set the map to interactive mode and redraw (tmap_last)
# Your code here